Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 29 04:27:09 2021"
library(MASS)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data("Boston")
dim(Boston)
## [1] 506 14
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Boston data frame is a sample R dataset from R library MASS. The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
par(mfrow=c(7,2), fin=c(5,5), mar=c(2,5,2,1), las=1, bty="n")
plot(Boston$crim, ylab = "Per capita", main = "Crime Rate")
plot(Boston$zn, ylab = "lots over 25,000 sq.ft", main = "Land zoned for lots")
plot(Boston$indus, ylab = "acres per town", main = "Non-retail business acres")
plot(Boston$medv, ylab = "Value of homes", main = "Owner-occupied homes")
plot(Boston$nox, ylab = "parts per 10 million", main = "NOX concentration")
plot(Boston$rm, ylab = "number of rooms", main = "Rooms per dwelling")
plot(Boston$age, ylab = "built prior to 1940", main = "Owner-occupied units built")
plot(Boston$dis, ylab = "weighted mean", main = "Distances to employment centres")
plot(Boston$rad, ylab = "index", main = "Accessibility to highways")
plot(Boston$tax, ylab = "tax rate per 10,000", main = "Property-tax rate")
plot(Boston$ptratio, ylab = "ratio", main = "Pupil-teacher")
plot(Boston$black, ylab = "1000(Bk - 0.63)^2", main = "Proportion of blacks")
plot(Boston$lstat, ylab = "Percent", main = "Lower status")
Compared to crime rate. The higher crime rate seems to be in rented apartments, with higher property tax rate. The owners doesn’t live in the same area. Also higher NOX rate (car pollution) and closer to highway seems to be increasing the crime rate. Employment centers are far away. On the other hand the crime rate is not relevant to skin color or apartment room count.
# Make standardized dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
# Create categorical variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# Drop the crime rate from old data and add new categorial
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
dim(train)
## [1] 404 14
Comparing the two summaries tells us that scaling doesn’t work on the most of the variables. When scaling, there are variables which can’t be negative. After scaling these variables are negative, because no clustering has being done.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crim)
plot(lda.fit, dimen = 2, col = classes, pch = classes, xlab = "Crime Rate", ylab = "Other factors")
lda.arrows(lda.fit, myscale = 2)
From this data we can see that accessibility to radial highways affects the crime rates the most. Nitrogen oxide concentration also has a noted link in the data for incrieasing the crime rate. Instead when proportion of residential land zones increases, the crime rates are lower.
crimeCat <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = crimeCat, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 13 1 0
## med_low 3 14 7 0
## med_high 1 1 10 1
## high 0 0 0 32
High crime rate predictions are quite consent and it seems to be more variance on the low and med_low predictions.
data("Boston")
Boston <- scale(Boston)
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
I came up with four clusters. The reason for this is that from the charts the the clusters are not that dispersed and there are big centers from all the charts, which can be combined. With five clusters, there were one too small cluster or it was too dispersed. With three, there was even clusters, but with four there are also four quite even clusters. More is better.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$centers)
Both give the same value, but with crime, the values are colored.